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Abstract 


Numerous applications in the area of computer system analysis can be effectively 
studied with Markov reward models. These models describe the behavior of the system 
with a continuous-time Markov chain, where a reward rate is associated with each state. 
In a reliability /availability model, upstates may have reward rate 1 and down states 
may have reward rate zero associated with them. In a queueing model, the number of 
jobs of certain type in a given state may be the reward rate attached to that state. In 
a combined model of performance and reliability, the reward rate of a state may be the 
computational capacity, or a related performance measure. Expected steady-state reward 
rate and expected instantaneous reward rate are clearly useful measures of the Markov 
reward model. More generally, the distribution of accumulated reward or time-averaged 
reward over a finite time interval may be determined from the solution of the Markov 
reward model. This information is of great practical significance in situations where the 
workload can be well characterized ( deterministically, or by continuous functions e.g . 
distributions ). 

The design process in the development of a computer system is an expensive and 
long term endeavor. For aerospace applications the reliability of the computer system 
is essential, as is the ability to complete critical workloads in a well defined real time 
interval. Consequently, effective modeling of such systems must take into account both 
performance and reliability. This fact motivates our use of Markov reward models to aid 
in the development and evaluation of fault tolerant computer systems. 
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1 Introduction 


In this annual report we summarize our research accomplishments under the auspices of the 
NASA grant NAG- 1-897 to develop tools and methods for the development of fast and reliable 
computer/ control systems. 

The research effort has been focused in two directions, the development of mathematical 
techniques and tools in order to enhance our understanding of relevant phenomena as well as 
the use of these tools for the analysis of problems relevant to NASA s present or long term 
needs. 


2 Description of Markov Reward Models 


Discrete-state continuous-time Markov chains are commonly used in the evaluation of 
computer system performance as well as the reliability and availability of fault-tolerant sys- 
tems. Such Markov models are often solved for either the steady-state or transient state 
probabilities [26, 63]. Weighted sums of state probabilities are then used to obtain measures 
of interest. In reliability/availability models the sum is taken over the set of operational 
states of the system. Since the operational states are a subset of all the possible states, the 
weight attached to each state is either 0 or 1. 

It is natural to extend the set of allowable weights to non-negative real numbers. For example, 
when computing the average queue length in queueing models, the weight attached to a state 
is a non- negative integer (the number of jobs in the queueing system). When we attach a 
non-negative real number, called the reward rate, to each state of a Markov chain, we obtain 
a Markov reward process. A second extension is to a class of interesting cumulative measures 
that cannot be obtained as a weighted sum of state probabilities. In the reliability /availability 
modeling of computer systems, these cumulative measures include the distribution of interval 
availability (Aj) and mean-time-to-failure (MTTF). 

In many environments, computer systems are expected to provide service even though com- 
ponent (or subsystem) failures may have occurred. In such fault-tolerant systems, the 
performance, and the reliability are both important in determining the ability of a sys- 
tem to deliver a specific amount of useful work in a finite time period. These considerations 
are particularly relevant to switching systems, databases, and general purpose computer sys- 
tems where graceful degradation and on-line repair of failed subsystems are common practice. 
Thus, there are two aspects of the system to be dealt with, the state to state (configuration to 
configuration) changes of the system over the interval (0, t) and the performance level (reward 
rate) associated with each state of the system. The evolution of the system through different 
configurations is characterized by a continuous- time Markov chain (CTMC) which will be 
referred to as a structure-state process. Associated with each state of the CTMC of the 
structure-state process is a reward rate to represent the performance level of the system in 
that state (configuration). The set of reward rates associated with the states of a structure- 
state process will be referred to as a reward structure. Thus each Markov reward model 
(MRM) has a structure state process that characterizes the evolution of the system through a 
set of states and a reward structure that characterizes the performance level associated with 
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each state. 


Different applications give rise to different interpretations of the underlying CTMC and/or 
different interpretations of the reward structure superimposed on the structure-state process. 
If we interpret the reward rate to be the speed of service and the transition structure of the 
CTMC to be failure and repair of components, the time needed to accumulate a fixed amount 
of reward will be the time to complete a task with a fixed work requirement in a failure-prone 
environment. From the distribution of the task-completion-time, we can derive quantities 
such as the probability of ever completing the task or the probability of completing the task 
before a given deadline. If we interpret the structure of the CTMC as modeling the arrival 
and departure of tasks in a queueing system, and interpret the reward rate as the number of 
jobs in the queue, we can obtain the time-averaged queue length distribution. By interpreting 
the structure-state process as task arrival/departure and interpreting the reward rate as the 
portion of the server capacity allocated to a ‘tagged’ job, the completion-time distribution will 
yield the response time distribution in the queueing system. The general utility of Markov 
reward modeling thus stems from the ability to assign and interpret both the structure-state 
process and the reward structure appropriately for a wide range of situations. 

Even after interpretations of the CTMC and reward structure have been made, a wide va- 
riety of measures may be obtained from the MRM. Choosing an appropriate measure for 
an application is important. Since the computational cost of obtaining the measures varies, 
generally the easiest to compute appropriate measure is best. Measures can characterize 
system behavior in a cumulative way (total work done in a given utilization period) or at 
an instant of time. For some applications, long range equilibrium behavior is more relevant, 
while for others transient conditions in the time interval shortly after system start up are 
more important. Finally, an expected value may be acceptable to answer some questions, 
while for other questions more detailed distributional information may be required. Before 
we more fully discuss various models and measures we introduce some standard notation for 
the structure state process CTMC, define some useful cumulative and instantaneous random 
variables, and present a small expository example in the next subsection. 


2.1 Notation 

The evolution of the system in time is represented by a finite-state stochastic process { Z(t ), t > 
0}. Thus Z(t) is the structure-state of the system at time t and Z(t) € S = {1, 2, ..., n}. 
The holding times in the structure-states are exponentially distributed and hence Z(t) is a 
homogeneous CTMC. Even in situations where the holding times are generally distributed, 
they may often be acceptably approximated using a finite number of exponential phases 
[15, 29]. We let q ijy 1 < ij < n, be the infinitesimal transition rate from state i to state j 
and Q = [qij] is the n by n generator matrix where 

n 

qu = - E q 'i- 

i- 

For the sake of clarity we also define <ft = — A fixed reward rate r t - is associated with each 
structure-state i, and the vector r defines the reward structure. To represent the reward rate 
of the system at time t, we let X(t) = r Z(t)- Finally, we let Pi(t) denote P[ Z(t) = i ] 7 the 
probability that the system is in state i at time t. The state probability vector p(t) may be 
computed by solving a matrix differential equation [26], 

p(t) = p(t ) Q. 
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Methods for computing p(t) are compared in [48]. 

A fundamental question about any system is simply, "What is the probability of completing a 
given amount of useful work within a specified time interval?” We let Y ( t ) be the accumulated 
reward until time that is, the area under the X(i) curve, 

Y(t)= f X(r)dr. 

Jo 

The value of Y(i) is the amount of reward accumulated by a system during the interval (0, *)• 
Consequently, by interpreting rewards as performance levels, we see that the distribution of 
accumulated reward is at the heart of characterizing systems that evolve through states with 
different reward rates (e.£., performance levels). In Figure 1 we depict a Markov reward 
model with a 3-state CTMC for the structure-state process and a simple reward structure, 
the transition rate matrix of the CTMC, as well as sample paths for the stochastic processes 
Z(t), X(t) and Y(t). Note that a given sample path of Z{t) determines unique sample paths 
for X(t) and Y (t). 

We denote the distribution of accumulated reward at time t evaluated at x as: 

y{x,t) = ?[Y(t)<x]. 

When the CTMC Z(t) has one or more absorbing states with a zero reward rate, we may 
also wish to compute the distribution of accumulated reward until absorption, 

y(x,oc)~P[Y(cc)<x] . 

The time average of Y(t) and the distribution of the time- averaged accumulated reward are 
denoted as: 

W(t) = - f X(r)dT and W(x, t ) = P[ W{t) < x ] . 
t Jo 

The distribution of time-averaged accumulated reward is particularly useful for comparing 
the behavior of a system over time intervals of different length. To complete our notation, we 
note that we have assumed a distinguished initial state. To explicitly indicate this dependence 
on the initial state we will use a subscript on cumulative and time-averaged random variables 
and their distributions. For example, Yi(i) denotes the accumulated reward for the interval 
(0,2) given that the initial state is i, (i.e., Z{ 0) = i). 

In the special case when we assign a reward rate 1 to operational states and zero to non- 
operational states, the expected reward rate at time 2, £[X(/)], is known as the instantaneous 
or point availability A(£), the expected reward rate in the steady-state, 2?[X(oo)], is called 
the steady-state availability A(oo) and W(t) is called the interval availability A/(£). 

For a more complete description of the historical development, notation, measures and models 
see [60]. 

Markov models have been used for the reliability and availability analysis of computer/communication 
systems [52, 55, 63]. More recently, Markov reward models have been used for the combined 
evaluation of performance and reliability [3, 6, 11, 29, 39, 58]. Our exposition of Markov 
reward models used them not only in the combined evaluation of performance and reliability 
but in many other problems of computer/communications systems analysis. Until recently 
distributions of cumulative measures and their time- averages were only obtainable for small 
or special Markovian systems. The use of Markov reward models extends our ability to model 
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Z{t ) Path 


X{t) Path 



Figure 1: 3-State Markov Reward Model with Sample Paths of Z(t), X(<) and Y(t) Processes. 



such systems and with the algorithms in [58, 50] we have obtained new and useful results. 
We have illustrated the wide applicability of Markov reward models and the effectiveness 
of our algorithm with a variety of examples in the area of computer systems analysis. By 
interpreting the structure- state process as the failure and repair behavior of components and 
the reward structure as the ability of the system to render useful service, we obtain per- 
formability measures of practical interest such as the distribution of accumulated reward or 
the completion time distribution depending on whether the time or reward requirement is 
fixed. If we interpret the structure-state process as characterizing the arrival and departure 
behavior of tasks in a queueing system, interpret the reward structure as the number of jobs 
in the queue and fix the time interval considered then we obtain the time-averaged queue 
length distribution. If we interpret the structure-state process as delineating the arrival and 
departure behavior of tasks in a queueing system, interpret the reward structure as the por- 
tion of service rendered to a ‘tagged’ job, and fix the reward requirement then we obtain the 
response time distribution of an M/M/l/k/PS queueing system. 

As the examples in papers [58, 60] show, the results can be used to make quantitative state- 
ments about the ability of computer systems to complete fixed amounts of work in a given 
time interval. The next few sections introduce the notion of a critical workload, and use a 
well characterized workload disatribution distribution to obtain critical workload completion 
probabilities. 


3 Modeling and Critical Workloads 


The design process in the development of a computer system is an expensive and long 
term endeavor. For aerospace applications the reliability of the computer system is essential, 
as is the ability to complete critical workloads in a well defined real time interval. Conse- 
quently, effective modeling of such systems must take into account both performance and 
reliability. The early use of models in the design of such complex computer systems can 
substantially improve the quality of the final result, as well as decrease costs. Whether a 
model is analytic, a simulation, or a prototype a well constructed model will yield insight 
into the functional capabilities of the components and their effect on the system as a whole. 

Often models can be used to find weaknesses and errors early in the design process, where 
they can be most easily and inexpensively rectified. 

We view a computer system as having three levels 

• hardware resources 

• an operating system to correctly and efficiently manage the hardware resources 

• applications that request resources in order to compute “responses”. 

The conceptual environment of the critical workloads ( applications whose timely completion 
is essential to continued safe operation ) is shown in Figure 2. 

The completion time of an application depends on the ability of the hardware and operating 
system to meet the resource requirements of the application in a timely fashion. Often, the 
performance bottleneck ( i.e. limiting ) requirements are memory access speeds and floating 
point computation speeds. On a loosely coupled or distributed system where the workload is 
composed of life critical, mission critical and non-critical applications there must be resources 
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Figure 2: Critical workload arriving to interrupt less important processing. 

to support the completion of a worst case conjunction of life and mission critical tasks within 
their real time deadlines. In order to do this effectively the critical applications often need 
to interrupt less important applications in order to complete before their deadlines. 

The central question in such an environment is simply: 

“What is the probability of completing critical workload x by real time deadline tV y 

There are a three broad areas that must be well characterized in order to determine an 
accurate answer to this question. 

• the behavior of the system during the real time interval 

• the status of the system when the critical workload arrives. 

• the resource requirements of the critical workload 

In the next few sections the methodology for characterizing the behavior of the computer 
system with Markov Reward Models is developed, then we look at the characteristics of 
different kinds of critical workloads and conclude with an indication of the steps needed to 
obtain the probability of completing a critical workload. 

4 Critical Workload Characteristics 


A computer system has a set of resources ( e.g processing elements - PEs, memories and 
communication capabilities to connect them ). Certainly the state of the system ( number 
of operational resources of various types ) characterizes the performance capabilities of the 
system. The hardware provides the basic facilities and the operating system allocates the 
resources as needed to applications. 

The critical factors in the design of computer systems for the last decade have been the 
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ability to access and process data. In many ways, the evolution of computer architectures 
has been an account of new ways to remove limitations on both these capabilities. Most 
computer applications are one of 2 classes of computation. 

Computational programs ( e.<7., scientific, engineering control ) that take parameters and 
perform arithmetic/logical operations with them to produce a result that is expressed in a 
few numbers. 

Data-based programs on the other hand, access large data-sets to gain limited information 
which is then read/modified. Examples would be updating the coverage of an insurance policy 
in a large data base or updating a few elements of a large array. 

Most programs tend to be either in one class or the other. Real time control programs 
tend to generate computational workloads, which suggests CPU rate ( mips or flops ) as 
an appropriate performance measure. Large-scale scientific computations in areas such as 
high speed vehicle design and structural/electronic/optical design or testing tend to involve 
large arrays and as such take on some of the characteristics of data-based programs, making 
memory bandwidth an important consideration as well. In cases where the performance 
bottleneck is not clear, a more detailed examination of the performance characteristics of the 
system under the types of workload in question can be used to determine performance levels. 

Interrupts insure that critical workloads will receive immediate attention, thereby making 
the delay until a critical workload is serviced very small. The small delay can be taken into 
account by lowering the deadline time t . In such a situation, the performance bottleneck of 
a critical real time control workload will be CPU rate. 

Clearly the computational requirements of a critical workload will strongly effect the prob- 
ability of completing it be a real time deadline To some extent the size of the critical 
workload will depend on the input that generated it. A conservative assumption would be 
that the critical workload resulted from a worst case set of inputs. If more information on the 
critical workload distribution is available, a more accurate determination of the probability 
of completing it by a real time deadline would be possible. 

Therefore let us define the critical workload size distribution through its density function. 

B(x ) = P[ critical workload size = x ] . 

The conservative worst case approach for a maximum workload, 0, can be represented as 
B(x) = £(x— 0), setting the probability that x is 6 to 1. Other critical workload models might 
include normally distributed, since in the absence of hard data the central limit theorem gives 
some justification for this model. Of course empirical distributions obtained from running 
the critical workload on representative portions of the input space could be used as well. 

If we regard the critical workload as the sum of the instructions executed for a given set of 
input data then we would expect the size of the critical workload, x to be approximately 
normally distributed because of the central limit theorem. Even though the central limit 
theorem assumes the independence of the random variables ( instruction workloads ) to show 
the asymptotically normal behavior of the sum, x. However, for many applications there 
is a control portion of the code that takes as input the data and chooses the appropriate 
execution path given the data. Often this is done with an eye to using very efficient methods 
where possible. A consequence of the presence of this upper level control structure is that 
the workload distribution will be multi-modal . A reasonable quantitative characterization of 
the improvement of the completion time distribution resulting from implementation of highly 
efficient solution methods for a subset of the possible inputs is valuable contribution. 
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5 Critical Workload Completion Probabilities 


In section 2 we introduced Markov reward models and the complementary distribution 
of accumulated reward ^ c (x,t), and a method to obtain 7.(0), the initial state probability 
vector of the Markov reward model. In section 3 we discussed several simple densities of 
critical workload size, J9(x). To obtain the unconditional probability of completing a critical 
workload (CW)with density i?(x), we need only uncondition over x, thus: 

[OO 

P[ CW completes by t ] = / y c (x 1 t)B(x)dx . (1) 

Jo 

For a conservative estimate of the workload completion probability, 5 (e) = ~ 0) an( i 

7(0) is such that P[ min configuration ] = 1. This reduces equation (4) to y c (0 y t) with 7.(0) 
such that the configuration when the job arrives is the minimal operational configuration. A 
refined estimate is possible by more realistically characterizing the workload density, B(x) 
and more accurate determination of 7(0) at the time the critical workload arrives. 

Where the consequences of failure to complete critical workloads by a real time deadline are 
grave and the cost of insuring timely completion are high, it is essential that effective tools 
to analyze the situation are developed to make the most of available hardware and human 
resources in the development and production of high performance fault tolerant systems. For 
results using this methodological approach and several interesting examples see [59] 


6 Conclusion 


Four interesting models are developedfrom which we obtain the following distributions, multi- 
processor perform ability, task completion time in a failure prone environment ( a semi-Markov 
model ), the response time in a processor sharing disciplined queueing system, and the time 
averaged queue length for a M/M/l/k queue. The final model is also used as an example to 
indicate hereto unknown dynamic behavior of the M/M/l/k queue ( Sect. 5.1 in [60]). 

Workload characterization, and the completion time distribution of workloads in various 
environments is then examined in [59] . 

Further details on the computational aspects of Markov reward models are available in [48, 
49, 50, 57, 58]. My current interests also include using approximation techniques, such as 
those indicated in [1] to improve the computational efficiency and accuracy of the hyperbolic 
PDE per form ability equation. Computation of the distribution of task completion time with 
a possible loss of work upon failure is treated in [8, 10, 36, 37, 42]. The question of the 
generation of the Markov models for large systems is addressed in [2, 11, 20, 22, 51]. 
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